3: Data Visualization and Data Manipulation

BSTA 526: R Programming for Health Data Science

Author
Affiliation

Meike Niederhausen, PhD & Jessica Minnier, PhD

OHSU-PSU School of Public Health

Published

January 22, 2026

Modified

January 22, 2026

1 Welcome to R Programming: Part 3!

In this session, we will cover the basics of data wrangling with dplyr functions and some options for customizing ggplots.


Before you get started:

Remember to save this notebook under a new name, such as part_03_b526_YOURNAME.qmd.

1.1 Learning Objectives

By the end of this session, you should be able to:

  1. Learn about errors and warnings and where to ask for help
  2. Become more confident with file management and uploading data
  3. Sort by a variable in a dataset using arrange()
  4. Select variables in a dataset using select()
  5. Filter a dataset using the filter() function
  6. Create and modify scatterplots and boxplots
  7. Split figures into multiple panels using facet_wrap()
  8. Customize your plots using built in theme()s

2 Getting Help on Errors

2.1 Understanding the difference between warnings and errors

  • A warning is an indication that the data or arguments isn’t quite what the function expected.
    • You can usually run the code, but you should be careful about it and verify the output.


  • An error means that the code can’t run at all given what you have given the function.
    • Errors can be difficult to understand, which is why…

2.2 Googling is StandaRd pRactice foR eRRors

The first thing I do when I encounter an error that I don’t understand is to search the internet for the error.

There are some resources that I especially check (in order):

  • Search “question + rcran” (i.e “hist rcran” or “make a boxplot ggplot”)
  • Search error in quotes (i.e. “Evaluation error: invalid type (closure) for variable ’***’“)
  • AI: such as Claude, Perplexity, chatgpt, etc.
  • More advanced/specific searches

2.3 How to find examples and information on function arguments

  • Read the vignettes! (or user guide if there is one)
  • Read the help documentation (sometimes not that useful)
  • Look at cheatsheets: https://posit.co/resources/cheatsheets/
  • Search the internet as above, or use AI.

3 Revisiting data loading and file management

Useful resources from last time:

3.1 Challenge 1 (10 minutes)

  1. Create a new .qmd file and name it smoke_messy.qmd. Save it in the “part3” folder.
  2. Load data from the file data/smoke_complete.xlsx from the second tab/sheet “smoke_messy” using the read_excel() function and the appropriate arguments. Remember to look at the data first! Remember to load your packages! Name the data smoke_messy.
  3. Use glimpse and View to view your data and observe the column names and column types, making sure numeric columns are numeric (class dbl).

Discussion questions

4 Data Manipulation using dplyr

  • We’re going to start off our data work with some data manipulation using the tidyverse package: dplyr.
  • dplyr is your all-purpose toolbox for filtering, summarizing, and transforming data.

4.1 Common dplyr “verbs”

dplyr uses common verbs to wrangle data

Examples:

  • arrange() - sorts a dataset by a variable
  • filter() - subsets a dataset by criteria
  • select() - returns only a few columns from a dataset
  • group_by()/summarize() - summarizes a dataset, such as counting frequencies and calculating means
  • mutate() - transforms variables in a dataset


  • %>% - the pipe function lets us join “verbs” together in a sequence of operations that transform a dataset.

4.2 Getting set up: load the smoke_complete dataset

# load the first tab smoke_complete
smoke_complete <- read_excel(
  here("part3", "data", "smoke_complete.xlsx"),
  sheet = 1, 
  na = "NA"
  )

# Remove some columns for easier viewing - we'll talk about the select function later

smoke_complete <- smoke_complete %>% 
  select(age_at_diagnosis, 
         tumor_stage,
         cigarettes_per_day, 
         gender, 
         vital_status, 
         disease)

5 Introducing the pipe (%>% or |>)

5.1 Connect multiple operations with %>%

  • Often we want to do multiple operations on our data and in a specific order.
  • For example, I might want to do the following:
    • Take my dataset smoke_complete and then
    • Sort it by cigarettes_per_day and then
    • filter to have only males from the data.

The pipe (%>%) function acts like the “and then”:

# Take my dataset smoke complete **and then**
smoke_complete %>%
  
# sort it by cigarettes_per day **and then**
  arrange(cigarettes_per_day) %>%

# filter it to only have males
  filter(gender == "male")
# A tibble: 786 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii              0.00822 male   dead         BLCA   
 2            29288 stage iii              0.0548  male   dead         BLCA   
 3            18983 stage ii               0.0822  male   dead         BLCA   
 4            22632 stage iiia             0.110   male   dead         LUSC   
 5            22632 stage iiia             0.110   male   dead         LUSC   
 6            20632 stage iib              0.110   male   alive        LUSC   
 7            20632 stage iib              0.110   male   alive        LUSC   
 8            25579 stage iii              0.110   male   dead         BLCA   
 9            23156 stage i                0.137   male   dead         LUSC   
10            23156 stage i                0.137   male   dead         LUSC   
# ℹ 776 more rows

5.2 %>%: output of one step is the input of another

  • You can think of a pipe as putting the output of one step as an input of another.
  • These two statements are equivalent:
smoke_complete %>%
  arrange(cigarettes_per_day)
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii               0.00822 male   dead         BLCA   
 2            19847 not reported            0.0110  female alive        CESC   
 3            14225 not reported            0.0219  female dead         CESC   
 4            31258 not reported            0.0312  female alive        CESC   
 5            27449 stage ii                0.0548  female dead         LUSC   
 6            27449 stage ii                0.0548  female dead         LUSC   
 7            16429 stage iii               0.0548  female dead         LUSC   
 8            16429 stage iii               0.0548  female dead         LUSC   
 9            15965 not reported            0.0548  female alive        CESC   
10            17465 not reported            0.0548  female dead         CESC   
# ℹ 1,142 more rows

and

arrange(smoke_complete, cigarettes_per_day)
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii               0.00822 male   dead         BLCA   
 2            19847 not reported            0.0110  female alive        CESC   
 3            14225 not reported            0.0219  female dead         CESC   
 4            31258 not reported            0.0312  female alive        CESC   
 5            27449 stage ii                0.0548  female dead         LUSC   
 6            27449 stage ii                0.0548  female dead         LUSC   
 7            16429 stage iii               0.0548  female dead         LUSC   
 8            16429 stage iii               0.0548  female dead         LUSC   
 9            15965 not reported            0.0548  female alive        CESC   
10            17465 not reported            0.0548  female dead         CESC   
# ℹ 1,142 more rows

5.3 Why pipes?

  • Pipes make it easier to put together commands.
  • Without pipes, I’d have to do the following:
arrange(
  filter(smoke_complete, 
         gender == "male"),
  cigarettes_per_day
)
# A tibble: 786 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii              0.00822 male   dead         BLCA   
 2            29288 stage iii              0.0548  male   dead         BLCA   
 3            18983 stage ii               0.0822  male   dead         BLCA   
 4            22632 stage iiia             0.110   male   dead         LUSC   
 5            22632 stage iiia             0.110   male   dead         LUSC   
 6            20632 stage iib              0.110   male   alive        LUSC   
 7            20632 stage iib              0.110   male   alive        LUSC   
 8            25579 stage iii              0.110   male   dead         BLCA   
 9            23156 stage i                0.137   male   dead         LUSC   
10            23156 stage i                0.137   male   dead         LUSC   
# ℹ 776 more rows
  • You can see that as you add more and more verbs, it gets more and more complicated.
  • We end up with very very nested parentheses. Or, many steps like this:
smoke_complete_new <- filter(smoke_complete, gender=="male")
smoke_complete_new <- arrange(smoke_complete_new, cigarettes_per_day)

That’s what pipes are supposed to save us from.

5.4 One Step at a Time

  • One big advantage of the pipe is that you can build your processing line by line.
  • In practice, as I work, I will often pipe things into View() to confirm I did things correctly:
smoke_complete %>%
  arrange(cigarettes_per_day) %>%
  View()
  • This is a good way to work. By building each step and verifying that the output is correct, we can build really long sequence of processing.
  • Even better is to include checks in your code to verify that your work is correct.

IMPORTANT POINT:

  • The code above has not saved our work.
  • The data frame object has not changed, the excel file has not changed.
  • To save your data frame after wrangling/cleaning/arranging it, save it as the same or another object:
# replaces previous version with new sorted version
smoke_complete <- smoke_complete %>%
  arrange(cigarettes_per_day)

# creates a new object
smoke_complete_sorted <- smoke_complete %>%
  arrange(cigarettes_per_day)

# look at your Environment tab

5.5 The difference between + and %>%

  • Remember
    • that + is for ggplot2 and
    • that %>% is for dplyr.
  • To keep them distinct and avoid confusion, I recommend keeping data processing and plotting steps separate.
  • You can chain them, but it can get confusing
# keeping data processing and plotting steps separate
smoke_complete_sorted <- smoke_complete %>%
  arrange(cigarettes_per_day)

ggplot(data = smoke_complete_sorted) +
  aes(x = cigarettes_per_day) +
  geom_histogram()

5.6 Base R pipe |>

  • R version 4.1.0 introduced the “native” pipe operator |>
    • that comes as part of the R syntax without the need for loading additional packages (i.e. the tidyverse).
  • Learn more about this at the Tidyverse website and the pipe section of R for Data Science.

You really just need to know that

The behaviour of the native pipe is by and large the same as that of the %>% pipe provided by the magrittr package. Both operators (|> and %>%) let you “pipe” an object forward to a function or call expression, thereby allowing you to express a sequence of operations that transform an object…there’s no need to commit entirely to one pipe or the other — you can use the base pipe for the majority of cases where it’s sufficient and use the magrittr pipe when you really need its special features - From the Tidyverse website

It’s fine to use either one. You will come across both of them in code that you see, so you should be aware.

  • You can change your RStudio app options to use |> instead of %>% if you like:
    • Tools -> Global Options -> Code -> select Use native pipe operator

6 Row wrangling

  • arrange()
  • filter()
  • &, |, !

6.1 Sorting data frames using arrange()

  • arrange() is a function that lets us sort a dataset by a specified variable

  • By default, it sorts in ascending order:

smoke_complete %>% 
  arrange(cigarettes_per_day)
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii               0.00822 male   dead         BLCA   
 2            19847 not reported            0.0110  female alive        CESC   
 3            14225 not reported            0.0219  female dead         CESC   
 4            31258 not reported            0.0312  female alive        CESC   
 5            27449 stage ii                0.0548  female dead         LUSC   
 6            27449 stage ii                0.0548  female dead         LUSC   
 7            16429 stage iii               0.0548  female dead         LUSC   
 8            16429 stage iii               0.0548  female dead         LUSC   
 9            15965 not reported            0.0548  female alive        CESC   
10            17465 not reported            0.0548  female dead         CESC   
# ℹ 1,142 more rows
  • To sort by descending order, you need to wrap the variable in the desc() function:
smoke_complete %>%
  arrange(desc(cigarettes_per_day))
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            17635 stage iv                 40    male   dead         BLCA   
 2            27708 stage ia                 13.2  male   dead         LUSC   
 3            27708 stage ia                 13.2  male   dead         LUSC   
 4            24477 stage ia                 11.0  male   dead         LUSC   
 5            24477 stage ia                 11.0  male   dead         LUSC   
 6            24713 stage iiia               10.5  male   dead         LUSC   
 7            24713 stage iiia               10.5  male   dead         LUSC   
 8            25646 stage ib                  9.86 male   alive        LUSC   
 9            25646 stage ib                  9.86 male   alive        LUSC   
10            25506 stage ib                  8.88 male   alive        LUSC   
# ℹ 1,142 more rows
  • You can also arrange by multiple variables.
smoke_complete %>%
  arrange(desc(cigarettes_per_day),
          tumor_stage)
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            17635 stage iv                 40    male   dead         BLCA   
 2            27708 stage ia                 13.2  male   dead         LUSC   
 3            27708 stage ia                 13.2  male   dead         LUSC   
 4            24477 stage ia                 11.0  male   dead         LUSC   
 5            24477 stage ia                 11.0  male   dead         LUSC   
 6            24713 stage iiia               10.5  male   dead         LUSC   
 7            24713 stage iiia               10.5  male   dead         LUSC   
 8            25646 stage ib                  9.86 male   alive        LUSC   
 9            25646 stage ib                  9.86 male   alive        LUSC   
10            25506 stage ib                  8.88 male   alive        LUSC   
# ℹ 1,142 more rows
  • Note that order of variables in arrange() matters!
smoke_complete %>%
  arrange(
    tumor_stage,
    desc(cigarettes_per_day)
          )
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            25220 not reported               3.12 female alive        CESC   
 2            28873 not reported               3.07 female dead         CESC   
 3            18773 not reported               2.90 female alive        CESC   
 4            16231 not reported               2.85 female dead         CESC   
 5            24059 not reported               2.74 female dead         CESC   
 6            20302 not reported               2.74 female alive        CESC   
 7            28826 not reported               2.47 male   dead         LUSC   
 8            28826 not reported               2.47 male   dead         LUSC   
 9            21520 not reported               2.19 female dead         CESC   
10            20207 not reported               2.19 female alive        CESC   
# ℹ 1,142 more rows

6.2 (Mini) Challenge 2

Use arrange() to sort by disease and descending by tumor_stage.

# edit the code below
smoke_complete %>%
  arrange()
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii               0.00822 male   dead         BLCA   
 2            19847 not reported            0.0110  female alive        CESC   
 3            14225 not reported            0.0219  female dead         CESC   
 4            31258 not reported            0.0312  female alive        CESC   
 5            27449 stage ii                0.0548  female dead         LUSC   
 6            27449 stage ii                0.0548  female dead         LUSC   
 7            16429 stage iii               0.0548  female dead         LUSC   
 8            16429 stage iii               0.0548  female dead         LUSC   
 9            15965 not reported            0.0548  female alive        CESC   
10            17465 not reported            0.0548  female dead         CESC   
# ℹ 1,142 more rows

6.3 filter()ing our data

  • filter() lets us subset our data according to specific criteria.
  • Let’s filter on the numeric variable cigarettes_per_day:
smoke_complete %>%
  filter(cigarettes_per_day < 20)
# A tibble: 1,151 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii               0.00822 male   dead         BLCA   
 2            19847 not reported            0.0110  female alive        CESC   
 3            14225 not reported            0.0219  female dead         CESC   
 4            31258 not reported            0.0312  female alive        CESC   
 5            27449 stage ii                0.0548  female dead         LUSC   
 6            27449 stage ii                0.0548  female dead         LUSC   
 7            16429 stage iii               0.0548  female dead         LUSC   
 8            16429 stage iii               0.0548  female dead         LUSC   
 9            15965 not reported            0.0548  female alive        CESC   
10            17465 not reported            0.0548  female dead         CESC   
# ℹ 1,141 more rows
  • We can also filter categorical variables.
    • This requires us to know the values (levels) of the categorical variable. More on this in a bit…
smoke_complete %>%
  filter(tumor_stage == "stage iv")
# A tibble: 91 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            21491 stage iv                 0.164 male   dead         BLCA   
 2            28082 stage iv                 0.219 male   alive        BLCA   
 3            24267 stage iv                 0.219 male   dead         BLCA   
 4            28470 stage iv                 0.274 male   alive        BLCA   
 5            23538 stage iv                 0.384 female alive        BLCA   
 6            22122 stage iv                 0.438 female alive        BLCA   
 7            30205 stage iv                 0.534 male   dead         BLCA   
 8            26893 stage iv                 0.548 male   dead         BLCA   
 9            27496 stage iv                 0.548 female dead         BLCA   
10            27397 stage iv                 0.548 female dead         BLCA   
# ℹ 81 more rows
  • We can also combine different filtering conditions:
smoke_complete %>%
  filter(tumor_stage == "stage iv", # note the comma
         cigarettes_per_day < 20)
# A tibble: 90 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            21491 stage iv                 0.164 male   dead         BLCA   
 2            28082 stage iv                 0.219 male   alive        BLCA   
 3            24267 stage iv                 0.219 male   dead         BLCA   
 4            28470 stage iv                 0.274 male   alive        BLCA   
 5            23538 stage iv                 0.384 female alive        BLCA   
 6            22122 stage iv                 0.438 female alive        BLCA   
 7            30205 stage iv                 0.534 male   dead         BLCA   
 8            26893 stage iv                 0.548 male   dead         BLCA   
 9            27496 stage iv                 0.548 female dead         BLCA   
10            27397 stage iv                 0.548 female dead         BLCA   
# ℹ 80 more rows
  • Using a comma requires BOTH conditions to be true.
    • In other words, in logic this is equivalent to using “and”. More on this next.

6.4 Filtering requires a little logic

  • We can chain multiple criteria using differnt operators:
    • , (AND)
    • & (AND)
    • | (OR)
  • But we need to review a little logic before we do this.
  • If we want to restrict to patients who were
    • male and stage iv,
    • we would use an & to chain these criteria together:
smoke_complete %>%
  filter(gender == "male" & 
           tumor_stage == "stage iv")
# A tibble: 75 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            21491 stage iv                 0.164 male   dead         BLCA   
 2            28082 stage iv                 0.219 male   alive        BLCA   
 3            24267 stage iv                 0.219 male   dead         BLCA   
 4            28470 stage iv                 0.274 male   alive        BLCA   
 5            30205 stage iv                 0.534 male   dead         BLCA   
 6            26893 stage iv                 0.548 male   dead         BLCA   
 7            30674 stage iv                 0.658 male   dead         BLCA   
 8            27049 stage iv                 0.767 male   alive        BLCA   
 9            21233 stage iv                 0.822 male   alive        BLCA   
10            20420 stage iv                 0.822 male   alive        BLCA   
# ℹ 65 more rows
  • Note that we could also use the comma in this case, as above:
smoke_complete %>%
  filter(gender == "male",
         tumor_stage == "stage iv")
# A tibble: 75 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            21491 stage iv                 0.164 male   dead         BLCA   
 2            28082 stage iv                 0.219 male   alive        BLCA   
 3            24267 stage iv                 0.219 male   dead         BLCA   
 4            28470 stage iv                 0.274 male   alive        BLCA   
 5            30205 stage iv                 0.534 male   dead         BLCA   
 6            26893 stage iv                 0.548 male   dead         BLCA   
 7            30674 stage iv                 0.658 male   dead         BLCA   
 8            27049 stage iv                 0.767 male   alive        BLCA   
 9            21233 stage iv                 0.822 male   alive        BLCA   
10            20420 stage iv                 0.822 male   alive        BLCA   
# ℹ 65 more rows
  • If we want to restrict to patients who were
    • male or stage iv,
    • we would use an | to chain these criteria together.
smoke_complete %>%
  filter(gender == "male" |
           tumor_stage == "stage iv")
# A tibble: 802 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii              0.00822 male   dead         BLCA   
 2            29288 stage iii              0.0548  male   dead         BLCA   
 3            18983 stage ii               0.0822  male   dead         BLCA   
 4            22632 stage iiia             0.110   male   dead         LUSC   
 5            22632 stage iiia             0.110   male   dead         LUSC   
 6            20632 stage iib              0.110   male   alive        LUSC   
 7            20632 stage iib              0.110   male   alive        LUSC   
 8            25579 stage iii              0.110   male   dead         BLCA   
 9            23156 stage i                0.137   male   dead         LUSC   
10            23156 stage i                0.137   male   dead         LUSC   
# ℹ 792 more rows

Think about it: which of the above two code blocks will return a larger number of patients?

6.5 != NOT EQUAL

  • We can also negate operators.
  • != means NOT EQUAL
smoke_complete %>%
  filter(gender != "male")
# A tibble: 366 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            19847 not reported             0.0110 female alive        CESC   
 2            14225 not reported             0.0219 female dead         CESC   
 3            31258 not reported             0.0312 female alive        CESC   
 4            27449 stage ii                 0.0548 female dead         LUSC   
 5            27449 stage ii                 0.0548 female dead         LUSC   
 6            16429 stage iii                0.0548 female dead         LUSC   
 7            16429 stage iii                0.0548 female dead         LUSC   
 8            15965 not reported             0.0548 female alive        CESC   
 9            17465 not reported             0.0548 female dead         CESC   
10            15849 not reported             0.0548 female alive        CESC   
# ℹ 356 more rows
  • Or, we can negate a whole statement:
smoke_complete %>%
  filter(!(gender == "male"))
# A tibble: 366 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            19847 not reported             0.0110 female alive        CESC   
 2            14225 not reported             0.0219 female dead         CESC   
 3            31258 not reported             0.0312 female alive        CESC   
 4            27449 stage ii                 0.0548 female dead         LUSC   
 5            27449 stage ii                 0.0548 female dead         LUSC   
 6            16429 stage iii                0.0548 female dead         LUSC   
 7            16429 stage iii                0.0548 female dead         LUSC   
 8            15965 not reported             0.0548 female alive        CESC   
 9            17465 not reported             0.0548 female dead         CESC   
10            15849 not reported             0.0548 female alive        CESC   
# ℹ 356 more rows
  • We can get a little complex.
    • What does the code below do?
    • What do you think %in% does?
smoke_complete %>%
  filter(tumor_stage!="not reported",
         disease %in% c("CESC", "LUSC"))
# A tibble: 830 × 6
   age_at_diagnosis tumor_stage cigarettes_per_day gender vital_status disease
              <dbl> <chr>                    <dbl> <chr>  <chr>        <chr>  
 1            27449 stage ii                0.0548 female dead         LUSC   
 2            27449 stage ii                0.0548 female dead         LUSC   
 3            16429 stage iii               0.0548 female dead         LUSC   
 4            16429 stage iii               0.0548 female dead         LUSC   
 5            22632 stage iiia              0.110  male   dead         LUSC   
 6            22632 stage iiia              0.110  male   dead         LUSC   
 7            20632 stage iib               0.110  male   alive        LUSC   
 8            20632 stage iib               0.110  male   alive        LUSC   
 9            23156 stage i                 0.137  male   dead         LUSC   
10            23156 stage i                 0.137  male   dead         LUSC   
# ℹ 820 more rows

6.6 More filter examples

See this BERD workshop slide

6.7 How to check what values exist in a categorical variable

  • We’ve already seen both skim() and glimpse() can give us an idea of what values exist in a categorical variable.
  • We can use distinct() to grab all of the unique values for a categorical variable,
    • and then use arrange() to sort them.
smoke_complete %>%
  distinct(tumor_stage) %>%
  arrange(tumor_stage)
# A tibble: 11 × 1
   tumor_stage 
   <chr>       
 1 not reported
 2 stage i     
 3 stage ia    
 4 stage ib    
 5 stage ii    
 6 stage iia   
 7 stage iib   
 8 stage iii   
 9 stage iiia  
10 stage iiib  
11 stage iv    
  • This can also alert us if there are categorical coding mistakes (such as misspellings) in our data.

  • Another tidy way to do this is to use the tabyl() function from the janitor package:

# library(janitor) # already loaded above, but just a heads up this function tabyl is from the janitor package

smoke_complete %>%
  tabyl(tumor_stage)
  tumor_stage   n     percent
 not reported  99 0.085937500
      stage i   7 0.006076389
     stage ia 146 0.126736111
     stage ib 266 0.230902778
     stage ii  65 0.056423611
    stage iia 112 0.097222222
    stage iib 148 0.128472222
    stage iii  86 0.074652778
   stage iiia 102 0.088541667
   stage iiib  30 0.026041667
     stage iv  91 0.078993056
  • We’ll learn about more tabyl() options later in the term.

6.8 (Mini) Challenge 3

  • Use filter() to restrict to patients from smoke_complete who have
    • disease == “LUSC” and
    • who smoke less than 1 cigarettes_per_day.
  • How many rows are left?
# edit this code
smoke_complete %>%
  filter()
# A tibble: 1,152 × 6
   age_at_diagnosis tumor_stage  cigarettes_per_day gender vital_status disease
              <dbl> <chr>                     <dbl> <chr>  <chr>        <chr>  
 1            18051 stage iii               0.00822 male   dead         BLCA   
 2            19847 not reported            0.0110  female alive        CESC   
 3            14225 not reported            0.0219  female dead         CESC   
 4            31258 not reported            0.0312  female alive        CESC   
 5            27449 stage ii                0.0548  female dead         LUSC   
 6            27449 stage ii                0.0548  female dead         LUSC   
 7            16429 stage iii               0.0548  female dead         LUSC   
 8            16429 stage iii               0.0548  female dead         LUSC   
 9            15965 not reported            0.0548  female alive        CESC   
10            17465 not reported            0.0548  female dead         CESC   
# ℹ 1,142 more rows

6.9 More about comparison and logical operators

7 Column wrangling

  • select()
  • tidyselect helpers

7.1 Selecting columns using select()

  • select() allows us to select variables from our dataset
smoke_complete %>%
  select(gender, tumor_stage) %>% 
  names()
[1] "gender"      "tumor_stage"
  • We can also select multiple consecutive columns with :
smoke_complete %>%
  select(gender:disease) %>% 
  names()
[1] "gender"       "vital_status" "disease"     
  • To select everything except some variables,
    • use a - or ! in front of the variables.
smoke_complete %>%
  select(-gender) %>% 
  names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "vital_status"       "disease"           
smoke_complete %>%
  select(-gender, -disease) %>% 
  names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "vital_status"      
# with !
smoke_complete %>%
  select(!gender) %>% 
  names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "vital_status"       "disease"           
smoke_complete %>%
  select(!c(gender, disease)) %>% # note the c()
  names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "vital_status"      

7.2 tidyselect helpers

  • There are ways to select column names by “searching” them.
  • These are called the tidyselect helpers. You can see examples here.
  • For instance, it might be useful (though not really in our case, although it would be with the brca_clinical data from last week), to select all columns where the column name includes the word “day”:
smoke_complete %>% 
  select(contains("day"))
# A tibble: 1,152 × 1
   cigarettes_per_day
                <dbl>
 1            0.00822
 2            0.0110 
 3            0.0219 
 4            0.0312 
 5            0.0548 
 6            0.0548 
 7            0.0548 
 8            0.0548 
 9            0.0548 
10            0.0548 
# ℹ 1,142 more rows
smoke_complete %>% 
  select(starts_with("v"))
# A tibble: 1,152 × 1
   vital_status
   <chr>       
 1 dead        
 2 alive       
 3 dead        
 4 alive       
 5 dead        
 6 dead        
 7 dead        
 8 dead        
 9 alive       
10 dead        
# ℹ 1,142 more rows
smoke_complete %>% 
  select(ends_with("s"))
# A tibble: 1,152 × 2
   age_at_diagnosis vital_status
              <dbl> <chr>       
 1            18051 dead        
 2            19847 alive       
 3            14225 dead        
 4            31258 alive       
 5            27449 dead        
 6            27449 dead        
 7            16429 dead        
 8            16429 dead        
 9            15965 alive       
10            17465 dead        
# ℹ 1,142 more rows

Just the last column:

smoke_complete %>% 
  select(last_col())
# A tibble: 1,152 × 1
   disease
   <chr>  
 1 BLCA   
 2 CESC   
 3 CESC   
 4 CESC   
 5 LUSC   
 6 LUSC   
 7 LUSC   
 8 LUSC   
 9 CESC   
10 CESC   
# ℹ 1,142 more rows

This is one useful shortcut (though use with caution, you need to know your column names and order):

smoke_complete %>% select(1:3, 5)
# A tibble: 1,152 × 4
   age_at_diagnosis tumor_stage  cigarettes_per_day vital_status
              <dbl> <chr>                     <dbl> <chr>       
 1            18051 stage iii               0.00822 dead        
 2            19847 not reported            0.0110  alive       
 3            14225 not reported            0.0219  dead        
 4            31258 not reported            0.0312  alive       
 5            27449 stage ii                0.0548  dead        
 6            27449 stage ii                0.0548  dead        
 7            16429 stage iii               0.0548  dead        
 8            16429 stage iii               0.0548  dead        
 9            15965 not reported            0.0548  alive       
10            17465 not reported            0.0548  dead        
# ℹ 1,142 more rows
  • You might also have a character vector of column names that you want to pull out.
  • Let’s say you have one saved:
mynames <- c("disease", "tumor_stage", "vital_status")

smoke_complete %>% select(any_of(mynames))
# A tibble: 1,152 × 3
   disease tumor_stage  vital_status
   <chr>   <chr>        <chr>       
 1 BLCA    stage iii    dead        
 2 CESC    not reported alive       
 3 CESC    not reported dead        
 4 CESC    not reported alive       
 5 LUSC    stage ii     dead        
 6 LUSC    stage ii     dead        
 7 LUSC    stage iii    dead        
 8 LUSC    stage iii    dead        
 9 CESC    not reported alive       
10 CESC    not reported dead        
# ℹ 1,142 more rows
  • There is a similar function all_of which fails if the column name is missing:
smoke_complete %>% select(all_of(mynames))

mynames2 <- c("id", mynames)
smoke_complete %>% select(any_of(mynames2))
smoke_complete %>% select(all_of(mynames2))  # error

7.3 Rearranging the order of columns

7.3.1 With select() and everything()

  • everything() is useful for some quick rearranging of columns
  • Below we move tumor_stage and disease to the beginning of the dataset:
smoke_complete %>% names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "gender"             "vital_status"       "disease"           
smoke_complete %>% 
  select(tumor_stage, disease, everything()) %>% 
  names()
[1] "tumor_stage"        "disease"            "age_at_diagnosis"  
[4] "cigarettes_per_day" "gender"             "vital_status"      

7.3.2 With relocate()

smoke_complete %>% 
  relocate(tumor_stage, disease, everything()) %>% 
  names()
[1] "tumor_stage"        "disease"            "age_at_diagnosis"  
[4] "cigarettes_per_day" "gender"             "vital_status"      

7.3.3 .after and .before

smoke_complete %>% 
  relocate(tumor_stage, .before = age_at_diagnosis) %>% 
  names()
[1] "tumor_stage"        "age_at_diagnosis"   "cigarettes_per_day"
[4] "gender"             "vital_status"       "disease"           
smoke_complete %>% 
  relocate(disease, .after = vital_status) %>% 
  names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "gender"             "vital_status"       "disease"           
smoke_complete %>% 
  relocate(disease, .after = last_col()) %>% 
  names()
[1] "age_at_diagnosis"   "tumor_stage"        "cigarettes_per_day"
[4] "gender"             "vital_status"       "disease"           

7.4 More tidyselect examples

See some more examples in this slide

For more info and learning about tidyselect, run this code in your console: (just the first 2 sections on tidyselect)

# install remotes package
install.packages("remotes")
# use remotes to install this package from github
remotes::install_github("laderast/tidyowl")

# load tidyowl package
library(tidyowl)

# interactive tutorial
tidyowl::learn_tidyselect()

7.5 The difference between filter() and select()

  • One thing to keep in mind is that:
    • filter() works on rows (think FILTER in Excel!), and
    • select() works on columns (select your relevant variables)

Keep that in mind!

7.6 Saving our results

  • Let’s save our processed data into the data/ folder.
  • We’ll save it as a csv file, which is short for comma separated value.
    • This is a file type that can be easily imported into Excel.
processed_data <- smoke_complete %>%
  select(-gender) %>%
  filter(cigarettes_per_day < 20)

write_excel_csv(
  x = processed_data,
  file = here("part3", "data","processed_data.csv"))
  • If you want to save it as an Excel file, you can use a function in the writexl package:
writexl::write_xlsx(
  x = processed_data,
  path = here("part3", "data","processed_data.xlsx"))

7.7 Challenge 4 (10+ minutes)

Go back to your smoke_messy.qmd file.

Using the messy smoke_messy data, write code to do these data cleaning steps:

  1. Use clean_names() from the janitor package to clean up the column names. Read the help documentation if you’ve never seen this.
  2. Use remove_empty() from the janitor package to remove empty columns and rows.
  3. Use fill() to fill in tumor stage values. (Read the example code in ?fill to see how to use it, you do need to specify arguments here).
  4. Remove the “Notes” column.
  5. Save the resulting data frame as a .csv file using write_excel_csv() in the data/ folder. Remember to use here().

7.8 Further Reading about dplyr

Please refer to this week’s readings for more reading about dplyr.

8 Customizing ggplots

8.1 Customizing a Scatterplot

Now that we have the data formatted nicely, we can start creating and customizing a figure.

our_plot <- ggplot(
  smoke_complete,
  aes(x = age_at_diagnosis, 
      y = cigarettes_per_day, 
      color = disease)) +
  geom_point() +
  labs(
    title = "Cigarettes per Day versus Age at Diagnosis",
    x = "Age at Diagnosis",
    y = "Cigarettes Smoked per Day"
    )
our_plot

8.2 Changing visual properties using built in themes

  • Adding (layering) a theme will change the settings of many visual aspects of the plot.
  • There are many different themes.
  • A nice simple one is theme_minimal() which comes with the ggplot2 package:
our_plot +   # our_plot created above
  theme_minimal()  # layer a theme onto our_plot

8.3 More themes

  • A complete list of themes included with ggplot2 is available here, and
    • we’ll cover ways to customize our own themes later in this lesson.
  • Other packages have additional themes, such ass ggthemes:
our_plot + ggthemes::theme_clean()

our_plot + ggthemes::theme_economist()

our_plot + ggthemes::theme_excel_new()

8.4 Using theme() to customize

  • Adding the theme() function lets us customize our plot further.
  • Most of these settings are set within one of the built in themes,
    • so if you want to overwrite what is in the theme,
    • you’ll need to set these settings after setting the theme, by adding them in order.

There are a few arguments that are really helpful to modify:

  • axis.title
  • axis.title.x (label for the x-axis)
  • axis.title.y (label for the y-axis)
  • legend.position (Placing the legend, including removing it)

and many more. You can read this (in progress) chapter in the ggplot2 book about themes to see more examples.

our_plot + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

our_plot + 
  theme_bw() + 
  theme(
    # change legend position
    legend.position = "bottom"
  )

our_plot + 
  theme_bw() +
  theme(
    # remove legends
    legend.position = "none",
    # also change font of various elements
    plot.title = element_text(face = "bold", size = 12))

8.5 Saving your work

  • After you’re satisfied with a plot, it’s likely you’d want to share it with other people or include in a manuscript or report.
# save plot to file
ggsave("awesomePlot.jpg", width = 10, height = 10, dpi = 300)
  • This automatically saves the last plot for which code was run in your root directory.

  • This command interprets the file format for export using the file suffix you specify.

  • The other arguments dictate the size (width and height) and resolution (dpi).

  • You can specify a different directory!

ggsave(here("part3", "image","awesomePlot.jpg"), width = 10, height = 10, dpi = 300)

8.6 Boxplots

  • Boxplots compare the distribution of a quantitative variable among categories.
  • Remember, vital_status is a character vector, but we’re not too worried about the implicit order of the categories, so we can use it as is in our boxplot.
# creating a boxplot

ggplot(smoke_complete,
       aes(x = vital_status, 
           y = cigarettes_per_day)) +
  geom_boxplot() 

  • The main differences from the scatterplots we created earlier are the geom type and the variables plotted.

8.7 Change colors

  • We can change the color of boxplots similarly to scatterplots.
  • However, we map to fill and not color:

Challenge: change fill to color and/or add a map to color.

# add color
ggplot(smoke_complete,
       aes(x = vital_status, 
           y = cigarettes_per_day, 
           fill = vital_status)) +
  geom_boxplot() 

8.8 fill or color scales

  • Similar to themes, there are built in color palettes you can use to change the colors in your plot.
  • You can also do this manually.

Here is the default fill scale:

# add color
ggplot(smoke_complete,
       aes(x = tumor_stage, 
           y = cigarettes_per_day, 
           fill = tumor_stage)) +
  geom_boxplot() 

  • The functions that change the fill start with scale_ such as this colorblind friendly scale in the ggthemes package.
  • However, it runs out of colors! So be careful with these.
# adding color
ggplot(smoke_complete,
       aes(x = tumor_stage, 
           y = cigarettes_per_day, 
           fill = tumor_stage)) +
  geom_boxplot() +
  scale_fill_colorblind()

  • A useful also colorblind friendly palate is the viridis palette built into ggplot2.
  • Read more about it, and the other palettes built in in the viridis package vignette.
# adding color
ggplot(smoke_complete,
       aes(x = tumor_stage, 
           y = cigarettes_per_day, 
           fill = tumor_stage)) +
  geom_boxplot() +
  scale_fill_viridis_d()

  • There are several you can use here:
# adding color
ggplot(smoke_complete,
       aes(x = tumor_stage, 
           y = cigarettes_per_day, 
           fill = tumor_stage)) +
  geom_boxplot() +
  scale_fill_viridis_d(option = "A")

  • We can also change fills/colors manually, such as:
# adding color
ggplot(smoke_complete,
       aes(x = tumor_stage, 
           y = cigarettes_per_day, 
           fill = gender)) +
  geom_boxplot() +
  scale_fill_manual(values = c("purple","lightgreen"))

8.9 Faceting our boxplot

  • One of the most powerful ways to change a visualization is by faceting.
  • We can make multiple plots using another categorical variable.
  • To do this, we have to add the facet_wrap() command to our plot.
    • We need to specify the variable to facet_wrap by using the vars() function to specify it as a variable.
# adding color
ggplot(smoke_complete,
       aes(x = vital_status, 
           y = cigarettes_per_day, 
           fill = vital_status)) +
  geom_boxplot() +
  ylim(c(0,20)) + # just to see the boxes better
  facet_wrap(vars(disease))

  • Don’t try to facet on a numeric variable - it won’t work.

  • Don’t forget to look at the help documentation (e.g., ?facet_wrap) to learn more about additional ways to customize your plots!

8.10 (Mini) Challenge 5

  • Facet the boxplot by gender. Don’t forget vars()
# update code
# change to eval: TRUE once completed

ggplot(smoke_complete,
       aes(x = vital_status, 
           y = cigarettes_per_day, 
           fill = vital_status)) +
  geom_boxplot() +
  ylim(c(0,20)) +
  facet_wrap()

8.11 facet_grid()

  • Another way to facet our plots is with facet_grid(),
    • which lets us select both rows and columns based on categorical variables:
ggplot(smoke_complete,
       aes(x = vital_status, 
           y = cigarettes_per_day, 
           fill = vital_status)) +
  geom_boxplot() +
  ylim(c(0,10)) +
  facet_grid(rows = vars(disease), 
             cols = vars(gender))

8.12 Challenge 6 (10 minutes)

  1. Create a boxplot of age at diagnosis (in years) stratified by tumor status. Make the y-axis tumor stage.
  2. Map fill to tumor stage.
  3. Use the minimal theme.
  4. Use facet grid to facet on disease (rows) and vital status (cols).
  5. Set arguments in facet_grid to have scales and space set to “free_y”.
  6. Change axis labels to look nicer, and change fill scale to another scale.
# update code
# change to eval: true once completed

ggplot(smoke_complete,
       aes()) +

8.13 Further ggplot learning

If you are interested in learning more about ggplot:

  • Documentation for all ggplot features is available here.
  • RStudio also publishes a ggplot cheat sheet that is really handy!
    • Note: this is for version 3 of ggplot
  • Customizing ggplot2 Cheatsheet is also handy, because it organizes ggplot2 commands by task.
    • Note: this is for version 2 of ggplot
  • New ggplot options with the latest version 4 released in September 2025

9 What you learned today

  • Pipes (%>%)
  • arrange()
  • filter()
  • select()
  • Customizing ggplots using theme()
  • Making boxplots
  • Intro to using scales in ggplot2 (more to come)
  • Faceting plots

10 Post Class Survey

Please fill out the post-class survey.

Your responses are anonymous in that I separate your names from the survey answers before compiling/reading.

You may want to review previous years’ feedback here.

11 Acknowledgements

  • Part 3 is based on the BSTA 504 Winter 2023 course, taught by Jessica Minnier.
    • I made modifications to update the material from RMarkdown to Quarto, and streamlined/edited content for slides.
    • Added: relocate, ! for not selecting columns, ggplot v. 4 resource
  • Minnier’s Acknowledgements:
    • This notebook was adapted from material from Kate Hertweck and https://fredhutch.io and from the R-Bootcamp by Ted Laderas and Jessica Minnier, as well as the “Data Wrangling in R with the Tidyverse” and “Data Visualization with R and ggplot2” OCTRI-BERD workshops by Jessica & Meike Niederhausen.